The Data Farm: Science Week Presentation

Learning from Data

Neil D. Lawrence and the Sheffield Machine Learning Research Group

5th March 2014

This notebook has been made available as part of our Open Data Science agenda. If you want to read more about this agenda there is a position paper/blog post available on it here.

This session is about 'learning from data'. How do we take the information on the internet and make sense of it. The answer, as you might expect, is using computers and mathematics. Luckily we also have a suite of tools to help. The first tool is a way of programming in python that really facilitates interaction with data. It is known as the "IPython Notebook", or more recently as the "Jupyter Project".

Welcome to the IPython Notebook

The notebook is a great way of interacting with computers. In particular it allows me to integrate text descriptions, maths and code all together in the same place. For me, that's what my research is all about. I try to take concepts that people can describe, then I try to capture the essence of the concept in a mathematical model. Then I try and implement the model on a computer, often combining it with data, to try and do something fun, useful or, ideally, both.

For the Science Week lecture on "The Data Farm" we looked at recommender systems.

Recommender Systems

Do you watch Netflix? Have you ever rated a movie there? Do you buy books or electronics on Amazon? How about grocery shopping? All these companies want you to buy more, watch more or listen more. The best way of getting you to do that is by showing you more of what you like. But what do you like? What sort of person are you? Can the computer tell? It can certainly try! And it does so with a "Recommender System". Recommender systems are so important to Netflix that they offered a $1 million dollar prize for improving theirs.

What is a Recommender System?

To understand you the computer has to turn you into a series of numbers (but don't worry, it's not painful). In the 1960s TV series, "The Prisoner", Patrick McGoohan's character, and his fellow prisoners and warders were assigned a number. But their number didn't seem to reflect anything about their statuses (it couldn't even be used to distinguish between prisoner and warder. To perform computations about preferences we need to turn users into a series of numbers. How many numbers does it need to describe you? How many numbers does it take to summarize anyone's tastes in movies, music or groceries? These are open questions that can be difficult to answer. It depends how much we want to know about you: if it's just your taste in film and clothes that's one thing. If we want to summarize your state of health, that's another.

Apparently a three year old's brain has $10^{15}$ synapses in it: a synapse governs the connection between neurons. It's quite possible that each synapse needs a number to represent it. So in the end it takes a lot of numbers to represent just the brain of a human, and that's before we start talking about what's going on in each of your cells.

We won't try to represent all this, instead we will perform a 'compression' on you. We will try to represent your opinions about films into just a few of numbers. We are going to work with movies, and just for fun, we thought we'd get you to rate some movies and try and build a recommendation system.

Note that python, like many other programs, has a lot of software libraries for doing different things. Here we used three separate libraries (which we also loaded into the computer with the import commands) to perform three different jobs. And the software we downloaded will also be available in the form of ... you guessed it ... a library. In this case a library that we've written in Sheffield called pods. In the next section we load in the library and use it to download some data for analysis.

Note, that the next step won't work unless you've run the code for the previous step! That's because when running code in the notebook, the python kernel (which is the software in the computer that's running the python code) has a state. By state we mean the values of the different variables in the system. The values of these variables can have a lot of effects, like which program code is in memory, and therefore accessibile to the computer. By running commands in this notebook, we change that state of the computer all the time. The software below relies on the software above being run to work. The notebook lets you know which box was run when by placing a number beside the box. After the code's been run you should see a number like:

In [3]:

if it was the third box to be run. Or In [2]: if it was the second (etc.).

Whilst the notebook is waiting for the kernel to run commands, you will see In [*]: as the prompt. If this happens for too long, it may be that you've asked the computer to do something too complicated and it's getting slow. You can interrupt the kernel by selecting Kernel->Interrupt from the menu in the browser window above. If you get really confused about what's going on and want to wipe the state of the kernel clean, then you can select Kernel->Restart. That will reset the kernel's state.


In [1]:
%matplotlib inline
import pods # Our software for python open data science (pods)
import pandas as pd # the pandas library for data analysis
import numpy as np
import matplotlib.pyplot as plt


d = pods.datasets.movie_body_count()
movies = d['Y']

Recommendation in Numbers

I said that we need to reduce everything to numbers for recommendations on the computer. Let's think about what those numbers could be. A recommender system aims to make suggestions for items (films, books, music, groceries) given what it knows about users' tastes. The recommendation engine needs to represent the taste of all the users and the characteristics of each object. That is what our numbers should do.

One way for organizing objects is to place related things close together. You can spend a lot of time doing this when you are supposed to be revising. It's called tidying up. Tidying up can be quite important, for example in a library we try and put books that are on related topics near to each other on the shelves. One system for doing this is known as Dewey Decimal Classification. In the Dewey Decimal Classification system (which dates from 1876) each subject is given a number (in fact it's a decimal number---no surprises there!). For example, the fields of Sciences and Mathematics are all given numbers starting in the 500s. Because computers weren't around in 1876 (well a man called Charles Babbage was toying with the idea ... and a lady called Ada Lovelace was even thinking about programming languages but no one had actually built a computer) ... anyway as I was saying, because computers weren't around in 1876 they didn't get a number so they ended up being given numbers staring with 004. For example, works on the 'mathematical principles' of Computer science are given the series 004.0151 (which we might normally write as 4.0151, but when you think about it we could equally write 004.0151). Whilst it's a classification system for splitting books into groups (this is sometimes called a taxonomy) the books in the system are also normally laid out in the same order as the numbers. Ah ... back to the point numbers.

So in the Dewey Decimal system we might now expect that neighbouring numbers represent books that are related to each other in subject. That seems to be exactly what we want for our recommender system! Could we somehow represent each film's subject according to a number?

The problem here is that we are representing it with only one number, a so called one dimensional representation of the system. Actually a one dimensional representation of a subject can be very awkward. To see this, let's have another think about the Dewey Decimal system. However, this time we can think about subjects which have numbers in the 900s. By the way, I'm not an expert on the Dewey Decimal system, I just read the Wikipedia page on it and then found a list of the numbers from Nova Southeastern University ... I may be a bit geeky, but memorizing the Dewey Decimal numbers is a step too far for even me (and don't memorize $\pi$ either, memorize $\tau$ instead ...).

Anway, back to those Dewey decimals ... if we look at the list for the 900s from the link above, we see that whilst the ordering for places is somewhat sensible, it is also rather arbitrary. In the 940s we have Europe listed from 940-949, Asia listed from 950-959 and Africa listed from 960-969. That seems OK, because Asia borders Europe, and Africa borders Asia (sorry I've slipped into a Geography lesson here ...). But it's also true that Africa is very close to Europe and the Carthagian's had an empire that went across both ... add did the Romans (a bit of History thrown in too).

A map of the Carthagian Empire from Wikipedia. Hannibal famously took his African elephants from the purple bit, through Spain, and over the Alps to take the Romans by surprise in the red bit. It did surprise the Romans, but perhaps not as much as it surprised the elephants.

Where should those books go? If they went between Africa and Europe they'd end up in Asia ... which would be wrong. Ah! But that's easy to solve, let's just represent the subject by two numbers and then we can lay out the books like a map!

Inspired by the map example we can start considering how we might lay out films in two dimensions. Although we also would like to characterize people too ... maybe we could also make a map of people's interests. In fact maybe the map could be a map of the same space. Let's think of it as a film library ... then the people could sit near the films they were interested in. We would have two maps of the same place. One of where the films are, and one of where the people are. People could sit near the films they like, and then they wouldn't have to move far to get the next film they'd want to watch (don't take the analogy too far though, if you start thinking about toilets and snack bars it kind of breaks down ... but fortunately we've turned you into a number now so you don't need the toilet or a snack bar).

That's it! We've cracked it ... or at least we've come up with one way of doing it. Although now, the fun just starts. Once you've had the idea, now you get to do the maths.

Vectors

Do you know about vectors? Vectors are great. Vectors are just a way of carrying a series of numbers about together. Not around town and to Meadowhall of course (even numbers don't like going to Meadowhall), but mathematically. I can represent lots of things by vectors, anything I can represent by numbers in fact. So since people can be numbers, I can represent people by vectors. This is not such a strange thing, and in fact it is being done to you all the time. If you use Facebook, they represent you by a vector (or actually more than one vector ... but all vectors can be returned to one vector at the end of the day ... have a think about why that is so). The vector that facebook have determines which items will come to the top of your news feed. It determines what adverts you see. It even determines who your friends are ... well actually that's taking it a bit far ... it just determines who Facebook suggests to you for friends.

Of course it's fun to represent people by numbers, if only to freak them out a bit, but in reality they don't do it for that reason. The vector Facebook stores about you is pretty important because it determines what your experience of Facebook is. If it's a good experience, then you'll go there and interact more ... if it's a dull experience you'll wonder off to WhatsApp or Instagram ... or maybe even Google+. If they get that vector right, they've captured something about your personality and they can create a better Facebook tailored for you.

When we write a vector down we do it like so: $$ \mathbf{v} = \begin{bmatrix}1\\2\end{bmatrix} $$ That is a special container for 2 numbers, the numbers themselves are 1 and 2. Sometimes we write the vector like this: $$ \mathbf{v} = \begin{bmatrix}v_1\\v_2\end{bmatrix} $$ where we are just saying the first number is given by $v_1$ and the second by $v_2$. Vectors are often used for speed (or more correctly velocity). So $v_1$ could be velocity in the Westerly direction and $v_2$ could be velocity in the Northerly direction. When people are piloting planes, they are often told they are being 'vectored'. It doesn't mean Facebook are turning them into a series of numbers (although Facebook probably are, that's just not what they mean on this occasion). It means that they are being told to go somewhere through a series of vectors.

I like $\mathbf{v}$ for vector so I'm going to use it to represent our films. Every film has it's own vector. But there are lots of films, so to avoid confusiong them we need to index our vectors. First we think that every film is given an index, $j$. Then the vector associated with the film is given by $\mathbf{v}_j$.

I'm going to do the same to you. And I'm going to use $\mathbf{u}$. I like to use $\mathbf{u}$ for you, and then index by $i$ for the users. So each user is represented by $u$ and indexed by $i$ like so: $\mathbf{u}_i$.

So now you are all $\mathbf{u}$s and films are all $\mathbf{v}$s we can think about how we pull together films and people.

If we have a movie rating, then we can represent it by $y$. That's easy, because it's just a number. But it's a number that is associated with a user and a film. That means it needs two indices: $i$ and $j$. One for the person and one for the film. So now we can represent all the ratings as $y$_{i,j}.

Dot Products

How do you multiply vectors together? One way is the dot product. If we take the dot product (or inner product) between two vectors then we simply take each of their corresponding elements, multiply them together and sum the result.

We now need a measure for determining the similarity between the item and the user: how close the user is sitting to the item in the rooom if you like. We are going to use the inner product between the vector representing the item and the vector representing the user.

The dot product between two vectors $\mathbf{a}$ and $\mathbf{b}$ is written as $\mathbf{a}\cdot\mathbf{b}$. Mathematically we can write the sum as follows: $$ \mathbf{a}\cdot\mathbf{b} = \sum_{i} a_i b_i $$

What's that got to do with recommendation? Well, the inner product can be seen as a measure of similarity. How so? Well one clever thing about the inner product that arises from trigonometry is that the inner product is related to the the cosine of the angle between the two vectors.

The smaller the angle between two vectors the larger the inner product. $$ \mathbf{a}\cdot\mathbf{b} = |\mathbf{a}||\mathbf{b}| \cos(\theta) $$ where $\theta$ is the angle between two vectors and $|\mathbf{a}|$ and $|\mathbf{b}|$ are the respective lengths of the two vectors. This diagram from Wikipedia illustrates part of what's going on:

Here we $A$ and $B$ as the vectors, we can think of $A$ and $B$ as two different people (traditionally we calle them Alice and Bob ... their nemesis is someone called Eve, but we don't need her today, so let's not go into that). The cosine is applied to the angle between the vectors (or people). Here the diagram illustrates that the cosine multiplied by the length of the vector gives the projection of one vector onto the other. This will be familiar to anyone who's had to work with 'resultant forces' in mathematics. Although in our case it's more like 'resultant personalities' or 'resultant tastes'. That would be the portion of Alice's tastes that is shared with Bob. We could construct a similar analagy for items, projecting one film or book onto another. Note that if the two items or users are in the same place ($A$=$B$), then for the purposes of our model they represent the same thing (or the same person).

So the cosine is important in our model, so let's plot it using the notebook. Again note that it takes more lines to do the plotting than it takes to compute cosine!


In [2]:
# plotting cosine
tau = 2*np.pi # tau is 2 times pi.

theta = np.linspace(-0.5*tau, 0.5*tau, 100) # this gives us 100 points in *radians* (in degrees they'd be from -180 to 180)
f = np.cos(theta)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(theta, f, 'r-')
plt.xlabel('$\\theta$')
plt.ylabel('$\\cos(\\theta)$')


Out[2]:
<matplotlib.text.Text at 0x7faae0fc8910>

We can immediately see that if the angle is small, cosine will be big. If it is large, cosine goes negative (at 180 degrees cosine is -1). So a negative inner product is associated with a wide angle and a large inner product is associated with a narrow angle. If the angle is wide then the two points described by the vectors will be far apart. If the angle is narrow then the two points will be nearer to each other (assuming the lengths of the two vectors doesn't vary much between the two inner products).

In our library we want each user to be sitting near the items they like (or rate highly) so we want the inner product to be large if the rating is high and small if the rating is low. We can do this by trying to force the inner product $\mathbf{u}_i^\top\mathbf{v}_j$ to be similar to the rating given by the user, $y_{i,j}$. To ensure this we will use a sum of squares objective function for all user ratings.

We can immediately see that if the angle is small, cosine will be big. If it is large, cosine goes negative (at 180 degrees cosine is -1). So a negative inner product is associated with a wide angle and a large inner product is associated with a narrow angle. If the angle is wide then the two points described by the vectors will be far apart. If the angle is narrow then the two points will be nearer to each other (assuming the lengths of the two vectors doesn't vary much between the two inner products).

In our library we want each user to be sitting near the items they like (or rate highly) so we want the inner product to be large if the rating is high and small if the rating is low. We can do this by trying to force the inner product $\mathbf{u}_i^\top\mathbf{v}_j$ to be similar to the rating given by the user, $y_{i,j}$. To ensure this we will use a sum of squares objective function for all user ratings.

The Library as an Infinite Plane

Let's imagine a library which stores all the items we are interested in, not just books, but films and shopping items too. Such a library is likely to be very large, so we'll imagine it on an infinite two dimensional plane (if your head starts hurting, stop imagining the plane and definitely don't think of a infinite plane in four dimensions, becaue that's more painful).

To recap, this means we can use numbers to represent the location of each item on the plane. For a two dimensional plane, we need to store the locations in a vector of numbers: we can decide that the $j$th item's location in the library is given by $$ \mathbf{v}_j = \begin{bmatrix} v_{j,1} \\ v_{j,2}\end{bmatrix}, $$ where $v_{j,1}$ represents the $j$th item's location in the East-West direction (or the $x$-axis) and $v_{j,2}$ represents the $j$th item's location in the North-South direction (or the $y$-axis). Now we need to specify the location where each user sits so that all the items that interest them are nearby: we can also represent the $i$th user's location with a vector $$ \mathbf{u}_i = \begin{bmatrix} u_{i,1} \\ u_{i,2}\end{bmatrix}. $$ Finally, we need some way of recording a given user's affinity for a given item. This affinity might be the rating that the user gives the film. We can use $y_{i,j}$ to represent user $i$'s affinity for item $j$.

For our film example we might imagine wanting to order films in a few ways. We could imagine organising films in the North-South direction as to how romantic they are. We could place the more romantic films further North and the less romantic films further South. For the East-West direction we could imagine ordering them according to how historic they are: we can imagine placing science fiction films to the East and historical drama to the West. In this case, fans of historical romances would be based in the North-West location, whilst fans of Science Fiction Action films might be located in the South-East (if we assume that 'Action' is the opposite of 'Romance', which is not necessarily the case). How do we lay out all these films? Have we got the right axes? In machine learning the answer is to 'let the data speak'. Use the data to try and obtain such a lay out. To do this we first need to obtain the data.


In [3]:
pods.datasets.data_path


Out[3]:
'/home/maxz/ods_data_cache/'

In [4]:
import os
# I have a python script on my computer that collates all your ratings. Then it writes the results to a dropbox directory.
# this line loads that result from the dropbox directory.
url = 'http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/dataset_mirror/data_farm/The_Data_Farm_Movie_Ratings.csv'
pods.util.download_url(url, store_directory = 'data_farm', 
                       save_name='The_Data_Farm_Movie_Ratings.csv')
movies = pd.read_csv(os.path.join('data_farm', 'The_Data_Farm_Movie_Ratings.csv'),
                     header=0, na_values=[-1, 'NaN']).set_index('userid')
# this line does 'set' subtraction (it converts the lists of columns into two sets, then subtracts the set that doesn't 
# include user names)
#user_names = list(set(movies.columns)-set(movies.columns[:9]))
movies.drop(labels=movies.columns[-4:], axis=1, inplace=True)
movies = movies.T
movies = movies.rename(columns={movies.columns[0]:'title'})
movies= movies.drop(movies.index[np.arange(14)])
movies=movies.rename(columns={'userid':'user_names'})
#movies.index = movies.index
#movies = movies.replace(-1, np.NaN)


The_Data_Farm_Movie_Ratings.csv
Downloading  http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/dataset_mirror/data_farm/The_Data_Farm_Movie_Ratings.csv -> ./data_farm/The_Data_Farm_Movie_Ratings.csv
[==============================]   0.021/0.021MB

Now we have the ratings you've all put in available in new movies data frame which includes your ratings (if you've made them).

Objective Functions and Optimisation

How do we get the computer to learn? My research area is known as 'machine learning', and it's one of the dominant areas of artificial intelligence at the moment. Broadly speaking, in machine learning we try to specify an objective function. The objective function is like a landscape in which we can move around. We move around by changing parameters. The objective function defines which bits of the landscape are good (ironically these are the low points) and the points in the landscape which are bad (these are the high points). This process is known as optimization (don't confuse it with optimism, where high points and low points provide different roles).

We learn from data by trying to find points in this landscape that are low down. You may already be familiar with this example though fitting a straight line, where the objective function is known as the sum of squared errors. This objective function was first proposed by Legendre in the early 19th century as a way of 'learning' the orbits of planets. $$ y=mx+c $$ but it can't go through all the points. Instead you choose values for $m$ and $c$ that make the line go as close to the points as possible. How do you defien close? One way is to measure the distance with the 'sum of squares'. $$ e = \sum_i (y_i - m x_i + c)^2 $$ where the points you want the line to go through are given by $x_i$ and $y_i$.

An Objective for the Recommender System

For our recommender system we also want to estimate the sum of square errors. But the error we are trying to minimize is given by $$ e = \sum_i \sum_j (y_{i,j} - \mathbf{u}_i\cdot\mathbf{v}_i)^2 $$ where the dot, $\cdot$, represents the inner product between the vectors. We want this error to be as small as possible, and we will reduce it by changing the values of each $\mathbf{u}_i$ and $\mathbf{v}_i$.

When we are done, if we want to predict your score for any film ($y_{i,j}$) then we just compute the inner product between $\mathbf{u}_i$ and $\mathbf{v}_j$.

Oh! But there's a problem! We don't have all the values for $y_{i,j}$ in our data! We only have the ones you rated. Actually, you can do a lot of clever maths to show that this doesn't matter and you don't really need to worry about those ones (you just throw those ones away). If someone asks why you can throw them away then just tell them it's because of Kolmogorov Consistency. Trust me, most people won't ask you again ... and if they do ask you again then you've definitely found someone interesting to talk to.

To take care of the 'throwing away' effect, then we often add another variable $s_{i,j}$ which we define to be 1 if we have the rating and 0 otherwise. Then we can leave the sum across all values in and write the objective function as:

The total objective function can be written as $$ e = \sum_{i,j} s_{i,j} (y_{i,j} - \mathbf{u}_i^\top \mathbf{v}_j)^2 $$

Objective Optimization

We need to minimize the sum of squared errors: we want the 'gap' between our prediction and the truth to be as small as possible. Today we are going to do that via gradient descent or steepest descent.

What is this algorithm? Well, we can again use the analogy of the objective function as a hilly surface, one that you want to get to the bottom of. Let's say you are at the top of Kinder and you want to get to Edale station. One way of getting there would be to just walk downhill in the steepest direction. Then if Edale station is at the bottom of the valley (which it isn't quite) and as long as you don't get stuck in one of the many hollows around the place (which you would) eventually you'd arrive at Kinder station (don't try this, even if you didn't get stuck in a hollow, you'd likely fall off a cliff and I'd get in trouble). OK, but you get the idea. It turns out that we can use the technique of differentiation to compute the gradient of the objective function. The gradient is also a vector. And the gradient descent algorithm involves simply taking your current vector and subtracting off the gradient. It's rather like what a ball would do if it was pushed off Kinder Scout (well, actually, the ball would probably be blown upwards knowing the winds up there). Now, what does that look like in practice I hear you ask? Enough of this talk of libraries, balls and fell walking (or falling)!

After some nice maths (this time due to Newton and Leibniz who independently developed differentiation) you get the following update rule everytime you see a film rating: $$ \mathbf{u}_{i} \leftarrow \mathbf{u}_{i} + \eta \mathbf{v}_j(y_{i,j}-\mathbf{u}_i^\top \mathbf{v}_j) $$ where $\eta$ (which is pronounced eta in English) is a Greek letter representing the learning rate. It's not a very complicated thing, but we can make it seem that way by using a greek letter (although I teach a lot of Greek's who must really think I've ovecomplicated the $\mathbf{u}$ and $\mathbf{v}$ thing.

The arrow just means that the updated value of $\mathbf{u}$ is given by the old value of $\mathbf{u}$ plus the vector associated with the film multiplied by the learning rate and $(y_{i,j}-\mathbf{u}_i^\top \mathbf{v}_j)$. But what is $(y_{i,j}-\mathbf{u}_i^\top \mathbf{v}_j) $? It is the difference between our prediction and the rating that we want to see. The prediction can either be just right (baby bear) in which case it will be zero and the new value for $\mathbf{u}_i$ will be the same as the old one or the prediction can be too large (daddy bear) in which case the number will be negative and $\mathbf{u}$ will move in the opposite direction to $\mathbf{v}$ or it can be too small (mummy bear) in which case the number will be positive and the $\mathbf{u}_i$ will move towards $\mathbf{v}$.

Three situations then. For each film you either have:

  1. Baby bear (stay wear you are).
  2. Daddy bear, predicted score too big! (move away from the film).
  3. Mummy bear, predicted score too small! (move towards the film).

OK then Goldilocks? What next? Well, that may be enough time thinking about it, time to implement it on the computer and play around a bit!

Steepest Descent Algorithm

In the steepest descent algorithm we aim to minimize the objective function by subtacting the gradient of the objective function from the parameters.

To program this we need to store the parameters in the computer. We are going to store our vectors together in a matrix which is simply a group of vectors placed together (rather like a vector is a group of numbers placed together). We will call these matrices $\mathbf{U}$ and the matrix $\mathbf{V}$.

We need some initial values for them. The code below sets the initial values to small randomly distributed values.

Actually, I told you a little 'white lie' there, we aren't storing the data in just a matrix, but we are again using a pandas data frame. As we saw above, this object doesn't just store the numbers but some of the information about the numbers, this means that rather than just indexing with numbers $i$ and $j$ we can index things with something a bit more human readable. To see this, let's print the starting positions, we should see very quickly the nicknames you used for storing your movie ratings alongside the numbers.

Preprocessing the Data

Very often there are sensible changes we can make to the data to make learning easier. Firstly, we need to convert the movies data frame into something a bit more usable for processing. We'd like the data to be in user/item/rating triplets. And we'd like to throw away all the missing ratings. The lines below 'prepare' the data for processing. They are a bit more complex than some of the other stuff we are doing (and they took me a while to work out! So don't worry if they are confusing, it's probably my poor choice of syntax, although it's not even poetic like perl). At the end we display Y to try and give you the idea of what we've done to the movies data to get Y out.


In [5]:
# first let's extract the names in movies.csv that have rated the data. The first columns are information about the movie

# This command takes the columns 'Film' and 'index' and creates a new 'triplet structure' like the one we mentioned above
Y = pd.melt(movies.reset_index(), id_vars=['title', 'index'], 
            var_name='user', value_name='rating')
            #value_vars=user_names)
# this command renames the column titled 'index' with the title 'item'
Y.rename(columns={'index':'item'}, inplace=True)
# This command removes all the missing values.
Y = Y.dropna(axis=0)
Y[['rating']] = Y[['rating']].astype(int)
Y.item = Y.item.apply(lambda x: int(x[1:]))

In [6]:
q = 2 # the dimension of our map of the 'library'
counter = 0 # a counter for storing how many updates we've done.
user_names = movies.columns[1:]
movie_index = Y.item.unique()
init_u = np.random.normal(size=(len(user_names), q))*0.001
init_v = np.random.normal(size=(len(movie_index), q))*0.001
U = pd.DataFrame(init_u, index=user_names)
V = pd.DataFrame(init_v, index=movie_index)

In [7]:
Y


Out[7]:
title item user rating
6 Scream (1996) 288 lawrennd 4
8 Air Force One (1997) 300 lawrennd 4
9 Independence Day (ID4) (1996) 121 lawrennd 3
50 Amadeus (1984) 191 lawrennd 4
69 Fish Called Wanda, A (1988) 153 lawrennd 4
81 Time to Kill, A (1996) 282 lawrennd 3
118 Star Trek IV: The Voyage Home (1986) 230 lawrennd 2
165 Grease (1978) 451 lawrennd 3
174 Pretty Woman (1990) 739 lawrennd 4
195 Fried Green Tomatoes (1991) 660 lawrennd 4
221 Mission: Impossible (1996) 405 spanna 2
222 Fugitive, The (1993) 79 spanna 5
268 Trainspotting (1996) 475 spanna 4
319 Good Will Hunting (1997) 272 spanna 3
403 Return of the Jedi (1983) 181 Smeagol 5
408 Air Force One (1997) 300 Smeagol 3
410 Raiders of the Lost Ark (1981) 174 Smeagol 4
415 Jerry Maguire (1996) 237 Smeagol 3
430 Men in Black (1997) 257 Smeagol 2
463 Broken Arrow (1996) 546 Smeagol 3
477 Dante's Peak (1997) 323 Smeagol 1
485 Hunt for Red October, The (1990) 265 Smeagol 4
512 Eraser (1996) 597 Smeagol 2
514 Beauty and the Beast (1991) 588 Smeagol 5
515 Boot, Das (1981) 515 Smeagol 4
616 Rock, The (1996) 117 filmgeek1988 4
636 L.A. Confidential (1997) 302 filmgeek1988 4
657 Jurassic Park (1993) 82 filmgeek1988 5
673 Die Hard (1988) 144 filmgeek1988 4
674 Casablanca (1942) 483 filmgeek1988 5
... ... ... ... ...
7466 Blues Brothers, The (1980) 186 Schmaylor 4
7468 Trainspotting (1996) 475 Schmaylor 5
7606 Scream (1996) 288 genius 3
7610 Raiders of the Lost Ark (1981) 174 genius 5
7625 Princess Bride, The (1987) 173 genius 5
7630 Men in Black (1997) 257 genius 3
7649 Apollo 13 (1995) 28 genius 4
7656 One Flew Over the Cuckoo's Nest (1975) 357 genius 5
7682 It's a Wonderful Life (1946) 496 genius 5
7705 Starship Troopers (1997) 271 genius 5
7720 Citizen Kane (1941) 134 genius 5
7727 This Is Spinal Tap (1984) 209 genius 5
7736 Unforgiven (1992) 203 genius 5
7753 Fantasia (1940) 432 genius 4
7759 Star Trek III: The Search for Spock (1984) 229 genius 3
7785 Grosse Pointe Blank (1997) 248 genius 4
7809 Independence Day (ID4) (1996) 121 kiramira 3
7818 Star Trek: First Contact (1996) 222 kiramira 3
7846 Shawshank Redemption, The (1994) 64 kiramira 3
7859 2001: A Space Odyssey (1968) 135 kiramira 3
7928 Sabrina (1995) 274 kiramira 3
8032 E.T. the Extra-Terrestrial (1982) 423 checkers 5
8044 When Harry Met Sally... (1989) 216 checkers 4
8045 Aliens (1986) 176 checkers 5
8051 Blade Runner (1982) 89 checkers 4
8054 Usual Suspects, The (1995) 12 checkers 4
8073 Die Hard (1988) 144 checkers 5
8079 Psycho (1960) 185 checkers 3
8111 Shining, The (1980) 200 checkers 5
8135 Taxi Driver (1976) 23 checkers 3

335 rows × 4 columns

If you have time you can have a look at what movies looks like and what Y looks like. If you are familiar with spreadsheets like Excel then try and imagine trying to store both these objects in the same sheet. How do you go from one representation to the other? It might take a lot of operations (i.e. clicking around and copying and pasting). Compare this to the lines above. They take more knowledge to learn, but once you have that knowledge your life is much easier.

This new format makes it much easier to look at the ratings as a whole. For example we can histogram them. extract the mean of all the ratings from the ratings column.


In [8]:
Y.rating.hist()
plt.title('Histogram of Ratings')


Out[8]:
<matplotlib.text.Text at 0x7faae08673d0>

And it's easy now to subtract the mean from the data.


In [9]:
Y.rating = Y.rating - Y.rating.mean()

In [10]:
d = pods.datasets.movielens100k()
film_info = d['film_info']
Y_lens=d['Y']
Y_lens=Y_lens[Y_lens['item'].isin(Y.item)].copy()
Y_lens.rating = Y_lens.rating - Y_lens.rating.mean() 
user_index = Y_lens.user.unique()
item_index = Y_lens.item.unique()
init_u = np.random.normal(size=(len(user_index), q))*0.001
U_lens = pd.DataFrame(init_u, index=user_index)
V = pd.DataFrame(init_v, index=item_index)


Acquiring resource: movielens100k

Details of data: 
MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

Please cite:
Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. Aug. 1999.

After downloading the data will take up 536272 bytes of space.

Data will be stored in /home/maxz/ods_data_cache/movielens100k.

You must also agree to the following license:
Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set.  The data set may be used for any research purposes under the following conditions:

     * The user may not state or imply any endorsement from the
       University of Minnesota or the GroupLens Research Group.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set, and must
       send us an electronic or paper copy of those publications.

     * The user may not redistribute the data without separate
       permission.

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permission
       from a faculty member of the GroupLens Research Project at the
       University of Minnesota.

If you have any further questions or comments, please contact GroupLens <grouplens-info@cs.umn.edu>.

Do you wish to proceed with the download? [yes/no]
yes
ml-100k.zip
Downloading  http://files.grouplens.org/datasets/movielens/ml-100k.zip -> /home/maxz/ods_data_cache/movielens100k/ml-100k.zip
[==============================]   4.696/4.696MB
ml-100k-README.txt
Downloading  http://files.grouplens.org/datasets/movielens/ml-100k-README.txt -> /home/maxz/ods_data_cache/movielens100k/ml-100k-README.txt
[==============================]   0.006/0.006MB

Subracting the mean is a sensible step, because the inner product can represent both negative numbers and positive numbers, yet the ratings we had so far were positive only. Subtracting the mean will cause $\mathbf{u}$ to try and be at a wide angle from $\mathbf{v}$ for the low ratings, because now the target value ($y_{i,j}$) is negative. Sensible steps like this taken before we do our analysis are known as pre-processing the data.

We'll also write a quick function that we can call for computing the objective, just so that we can monitor progress.


In [11]:
def objective(Y_vals, U_vals, V_vals): 
    obj = 0.
    for ind, series in Y_vals.iterrows():
        film = series['item']
        user = series['user']
        rating = series['rating']
        prediction = np.dot(U_vals.loc[user], V_vals.loc[film])  #vTu
        diff = prediction - rating # vTu - y
        obj += diff*diff
    return obj

And let's set up our learning rate, and a counter to store the number of iterations you've done.


In [12]:
learn_rate = 0.05
counter = 0

Optimisation

Now we have the initial values set and the pre-processing done, we can start the optimisation.


In [13]:
epochs = 1
for i in range(epochs):
    # loop across the ratings        
    for ind, series in Y_lens.iterrows():
        # get which film and user and value the rating us
        film = series['item']
        user = series['user']
        y = series['rating']
        # get u and v out from storage.
        u = U_lens.loc[user].values
        v = V.loc[film].values # get the u and v vectors from the matrices.
        # compute the update for u
        diff = (y-np.dot(u,v))#-mu-nu)
        u = u + learn_rate*v*diff 
        # compute the update for v
        v = v + learn_rate*u*diff
        # put u and v back in storage.
        U_lens.loc[user]=u
        V.loc[film]=v

        counter +=1
        if not counter % 10000: # checks if there is no remainder from dividing counter by 1000.
            obj = objective(Y, U, V) #, Mu, Nu)
            print "Update:", counter, "Objective:", obj 
        if counter>30000:
            break


Update: 10000 Objective: 377.325376762
Update: 20000 Objective: 377.325412377
Update: 30000 Objective: 377.32548291

In [14]:
learn_rate = 0.1
counter = 0

In [15]:
import sys
epochs = 30
for i in range(epochs):
    # loop across the ratings        
    for ind, series in Y.iterrows():
        # get which film and user and value the rating us
        film = series['item']
        user = series['user']
        y = series['rating']
        # get u and v out from storage.
        u = U.loc[user].values
        v = V.loc[film].values # get the u and v vectors from the matrices.
        # compute the update for u
        diff = (y-np.dot(u,v))#-mu-nu)
        u = u + learn_rate*v*diff 
        # compute the update for v
        v = v + learn_rate*u*diff
        # put u and v back in storage.
        U.loc[user]=u
        V.loc[film]=v

        counter +=1
        if not counter % 1000: # checks if there is no remainder from dividing counter by 1000.
            obj = objective(Y, U, V) #, Mu, Nu)
            print "Update:", counter, "Objective:", obj


Update: 1000 Objective: 377.30079382
Update: 2000 Objective: 376.826705483
Update: 3000 Objective: 366.468796916
Update: 4000 Objective: 299.820412811
Update: 5000 Objective: 199.491258046
Update: 6000 Objective: 135.437986824
Update: 7000 Objective: 95.0088250136
Update: 8000 Objective: 69.8400069516
Update: 9000 Objective: 57.188132999
Update: 10000 Objective: 49.1487765629

Plotting the Result

So now we have learned those parameters, let's plot the map of where you are and where the movies are in the library!


In [16]:
V


Out[16]:
0 1
1 0.183076 0.189615
7 -0.561889 -0.910444
8 -0.750350 0.646612
11 -0.298702 -1.006292
13 1.190287 -0.249183
15 -1.089392 -0.539742
22 0.528807 -0.918695
25 0.220105 0.253632
28 -0.923326 -0.323664
50 0.541759 -0.446159
71 0.214103 0.105616
77 -0.777510 -0.395414
79 0.162071 0.935833
83 -0.786773 -0.388716
89 0.667328 -0.475148
95 -0.385357 -0.721846
99 -0.849732 -0.806150
124 -0.762526 0.288493
127 -0.997096 0.160759
133 0.823060 0.352506
135 0.845122 0.813553
144 0.860058 -0.014109
153 1.201314 -0.488023
172 -1.373266 -0.012550
173 -0.737328 0.493552
176 1.057559 -0.034322
179 0.792977 0.114314
181 -1.110744 0.699923
182 0.782992 0.171123
187 -1.012451 -0.064854
... ... ...
690 0.213227 0.080885
479 0.813733 -0.169723
511 -0.254329 -0.125884
582 -0.287145 0.023956
24 -0.366711 0.049370
196 -0.675826 0.028841
393 -0.299633 -0.108062
423 0.485487 0.090516
508 -0.672849 -0.343811
117 0.013734 0.138576
188 0.858462 0.090509
209 -0.619404 0.545979
218 0.205800 0.188982
307 0.103603 0.314980
313 -0.312871 0.978437
323 -0.953880 -1.319440
411 0.011786 -0.817824
471 -0.335378 -0.302101
475 -0.711656 0.136600
515 0.137971 0.193580
550 -0.358279 -0.279781
732 0.336907 -0.130924
739 -0.277026 0.371131
751 0.069441 0.270111
118 -0.747108 -0.379176
248 -0.288034 0.096965
252 -1.488702 0.556678
476 -0.059711 0.176887
154 0.335405 0.519721
685 -1.214512 0.487666

155 rows × 2 columns


In [18]:
# create a figure axis
fig, axes = plt.subplots(figsize=(8,8))
plt.rcParams['text.usetex'] = False

# plot the movies, and label with your names
axes.plot(U[0], U[1], 'rx')
for name in U.index:
    axes.text(U[0][name], U[1][name], name)

# create a new figure axis    
fig, axes = plt.subplots(figsize=(8,8))

axes.plot(V[0], V[1], 'rx')
for index in V.index:
    # display the movie name if it was rated more than 5 times.
    if np.sum(Y['item']==index)>2:
        axes.text(V[0][index], V[1][index], movies['title']['q'+str(index)])



In [19]:
movies['title']['q'+str(index)]


Out[19]:
'Executive Decision (1996)'

Making Predictions

Predictions can be made from the model of the appropriate rating for a given user, $i$, for a given film, $j$, by simply taking the inner product between their vectors $\mathbf{u}_i$ and $\mathbf{v}_j$.

Is Our Map Enough? Are Our Data Enough?

Is two dimensions really enough to capture the complexity of humans and their artforms? Is that how shallow we are? On second thoughts, don't answer that. We would certainly like to think that we need more than two dimensions to capture our complexity.

Let's extend our books and libraries analogy further: consider how we should place books that have a historical timeframe as well as some geographical location. Do we really want books from the 2nd World War to sit alongside books from the Roman Empire? Books on the American invasion of Sicily in 1943 are perhaps less related to books about Carthage than those books that study the Jewish Revolt from 66-70 (in the Roman Province of Judaea---more History!). So books that relate to subjects which are closer in time should probably be stored together. However, a student of 'rebellion against empire' may also be interested in the relationship between the Jewish Revolt of 66-70 and the Indian Rebellion of 1857 (against the British Empire), nearly 1800 years later (they might also like the Star Wars movies ...). Whilst the technologies involved in these revolts would be different, they still involve people (who we argued could all be summarised by sets of numbers) and the psychology of those people is shared: a rebellious nation angainst their imperial masters, triggered by misrule with a religious and cultural background.

To capture such nuances we would need further dimensions in our latent representation. But are further dimensions justified by the amount of data we have? Can we really understand the facets of a film that only has at most three or four ratings? One answer is to collect more data to justify extra dimensions. If you have got this far, maybe you'd like to 'fight on' against the imperial misrule of data, and consider the movielens 100k data below: many more ratings from many more users!

Going Further: Play Some More

If you want to take this model further then you'll need more data. One possible source of data is the movielens data set. They have data sets containing up to ten million movie ratings. The few ratings we were able to collect in the class are not enough to capture the rich structure underlying these films. Imagine if we assume that the ratings are uniformly distributed between 1 and 5. If you know something about information theory then you could use that to work out the maximum number of bits of information we could gain per rating.

Now we'll download the movielens 100k data and see if we can extract information about these movies.


In [ ]:


In [20]:
Y_lens.item


Out[20]:
index
874965758      1
875071561      7
875072484      8
875072262     11
875071805     13
875071608     15
875072404     22
875071805     25
875072173     28
874965954     50
876892425     71
876893205     77
875072865     79
875072370     83
875072484     89
875072303     95
875072547     99
875071484    124
874965706    127
876892818    133
875072404    135
875073180    144
876893230    153
874965478    172
878541803    173
876892468    176
875072370    179
874965739    181
875072520    182
874965678    187
            ... 
886832337    250
886831374    313
886831415    333
876762200    258
876768813    300
891356390    181
891350008    300
891357137    476
880260956    222
880260636    326
880261610    597
885921577      8
885921577     56
885921953    153
885921451    172
885921310    181
885921953    194
875049144      1
875049038    222
875049038    273
875048887    298
875048495    300
891283517     31
891282396    269
891282396    313
891282931    511
888639093     12
888692413    237
875502042    471
875502042    685
Name: item, dtype: int64

Can you do stochastic gradient descent on this data and make the movie map? Let's start with the preprocessing.


In [ ]:

Now let's create the data frames for U and V.


In [ ]:

Once again let's set the learning rate, and reset the counter.


In [21]:
learn_rate = 0.01
counter = 0

Now if you have a lot of time you can run it across the movie lens data!


In [22]:
Y_lens


Out[22]:
user item rating split
index
874965758 1 1 1.212109 u1.base
875071561 1 7 0.212109 u1.base
875072484 1 8 -2.787891 u1.base
875072262 1 11 -1.787891 u1.base
875071805 1 13 1.212109 u1.base
875071608 1 15 1.212109 u1.base
875072404 1 22 0.212109 u1.base
875071805 1 25 0.212109 u1.base
875072173 1 28 0.212109 u1.base
874965954 1 50 1.212109 u1.base
876892425 1 71 -0.787891 u1.base
876893205 1 77 0.212109 u1.base
875072865 1 79 0.212109 u1.base
875072370 1 83 -0.787891 u1.base
875072484 1 89 1.212109 u1.base
875072303 1 95 0.212109 u1.base
875072547 1 99 -0.787891 u1.base
875071484 1 124 1.212109 u1.base
874965706 1 127 1.212109 u1.base
876892818 1 133 0.212109 u1.base
875072404 1 135 0.212109 u1.base
875073180 1 144 0.212109 u1.base
876893230 1 153 -0.787891 u1.base
874965478 1 172 1.212109 u1.base
878541803 1 173 1.212109 u1.base
876892468 1 176 1.212109 u1.base
875072370 1 179 -0.787891 u1.base
874965739 1 181 1.212109 u1.base
875072520 1 182 0.212109 u1.base
874965678 1 187 0.212109 u1.base
... ... ... ... ...
886832337 936 250 1.212109 ub.test
886831374 936 313 0.212109 ub.test
886831415 936 333 -0.787891 ub.test
876762200 937 258 0.212109 ub.test
876768813 937 300 0.212109 ub.test
891356390 938 181 1.212109 ub.test
891350008 938 300 -0.787891 ub.test
891357137 938 476 0.212109 ub.test
880260956 939 222 1.212109 ub.test
880260636 939 326 1.212109 ub.test
880261610 939 597 0.212109 ub.test
885921577 940 8 1.212109 ub.test
885921577 940 56 1.212109 ub.test
885921953 940 153 -1.787891 ub.test
885921451 940 172 0.212109 ub.test
885921310 940 181 -0.787891 ub.test
885921953 940 194 1.212109 ub.test
875049144 941 1 1.212109 ub.test
875049038 941 222 -1.787891 ub.test
875049038 941 273 -0.787891 ub.test
875048887 941 298 1.212109 ub.test
875048495 941 300 0.212109 ub.test
891283517 942 31 1.212109 ub.test
891282396 942 269 -1.787891 ub.test
891282396 942 313 -0.787891 ub.test
891282931 942 511 0.212109 ub.test
888639093 943 12 1.212109 ub.test
888692413 943 237 0.212109 ub.test
875502042 943 471 1.212109 ub.test
875502042 943 685 0.212109 ub.test

266259 rows × 4 columns


In [ ]:
# warning, this will take a lot longer than the example above!
# if you start it accidentally and want it to stop, use Kernel->Interrupt from the menu above.
import sys
epochs = 5
for i in range(epochs):
    # loop across the ratings
    for ind, series in Y_lens.iterrows():
        # get which film and user and value the rating us
        film = series['item']
        user = series['user']
        y = series['rating']
        # get u and v out from storage
        u = U_lens.loc[user].values
        v = V_lens.loc[film].values 
        # compute the update for u
        diff = (y-np.dot(u,v)-mu-nu)
        u = u + learn_rate*v*diff
        # compute the update for v
        v = v + learn_rate*u*diff
        # put u, v, mu and nu back in storage.
        U_lens.loc[user]=u
        V_lens.loc[film]=v
        counter +=1
        if not counter % 100000: # checks if there is no remainder from dividing counter by 100000.
            # let's not compute entire objective, it would take too long. Select a every 4000th sample.
            obj = objective(Y_lens.loc[Y_lens.index[::4000]], U_lens, V_lens, mu_lens, nu_lens)
            print "Update:", counter, "Objective:", obj

Now let's try plotting the new map of the movielens movies.


In [ ]:
# create a new figure axis    
fig, axes = plt.subplots(figsize=(12,12))

for index in V_lens.index:
    # plot the movie name if it was rated more than 2000 times.
    if np.sum(Y_lens['item']==index)>2000:
        axes.plot(V_lens[0][index], V_lens[1][index], 'rx')
        axes.text(V_lens[0][index], V_lens[1][index], film_info['title'][index])

Doing this at Home

The great thing about data science is that there is nothing special about the University environment. All the tools we have used here are freely available, which means you can do all this at home! There are some web pages that help on installing python or there are some commercial packages which pull everything together for you Enthought Canopy and Anaconda.

The movielens 100k example above was pretty slow, one way we can try and make bigger moves with each iteration is to add momentum. Do you think you can manage that? There is a description of what momentum is from a University level CS course here.


In [ ]: